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June 2004 Proceedings of the 2004 ACM SIGMOD international conference on 
Management of data 

Full text available: ^pdf(188.52 KB) Additional Information: full citation, a bs tr ac t, re fer en ces 

As more and more query processing work can be done in main memory access is becoming 
a significant cost component of database operations. Recent database research has shown 
that most of the memory stalls are due to second-level cache data misses and first-level 
instruction cache misses. While a lot of research has focused on reducing the data cache 
misses, relatively little research has been done on improving the instruction cache 
performance of database systems. We first answer the question "Why ... 
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St a ck cachin g for int erpret er s 
M. Anton Ertl 

June 1995 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1995 conference 

on Programming language design and implementation, volume 30 issue 6 
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An interpreter can spend a significant part of its execution time on accessing arguments of 
virtual machine instructions. This paper explores two methods to reduce this overhead for 
virtual stack machines by caching top-of-stack values in (real machine) registers. The 
dynamic method is based on having, for every possible state of the cache, one specialized 
version of the whole interpreter; the execution of an instruction usually changes the state of 
the cache and the next i ... 



http://portal.acm.org/resultsxfha?coll=ACM&dl=ACM&CFro=30425275& 11/3/04 



Results (page 1): cache probe instruction 



Page 3 of 6 



Trac e c a ch e: a l o w l a t e n c y app roach to h i gh ba n dwidt h i nstruction fetchin g 
Eric Rotenberg, Steve Bennett, James E. Smith 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: ^gp^ 33 ^gj {{p Additional Information: full citation, ab str act, refe re nce s, citings, index 
Publisher Site 

As the issue width of superscalar processors is increased, instruction fetch bandwidth 
requirements will also increase. It will become necessary to fetch multiple basic blocks per 
cycle. Conventional instruction caches hinder this effort because long instruction sequences 
are not always in contiguous cache locations. We propose supplementing the conventional 
instruction cache with a trace cache. This structure caches traces of the dynamic instruction 
stream, so instructions that are otherwise no ... 

Keywords: instruction cache, instruction fetching, multiple branch prediction, superscalar 
processors, trace cache 



9 Part i t i on ed i n str u c t ion ca c he ar ch ite ct ure for energy effi c iency 

Soontae Kim, N. Vijaykrishnan, Mahmut Kandemir, Anand Sivasubramaniam, Mary Jane Irwin 
May 2003 ACM Transactions on Embedded Computing Systems (TECS), volume 2 issue 2 

Full text available: ^|p df(817. 8 1 KB ) Additional Information: full c i t ation, abstract, r ef erences, ind e x ter m s 

The demand for high-performance architectures and powerful battery-operated mobile 
devices has accentuated the need for low-power systems. In many media and embedded 
applications, the memory system can consume more than 50&percnt; of the overall system 
energy, making it a ripe candidate for optimization. To address this increasingly important 
problem, this article studies energy-efficient cache architectures in the memory hierarchy 
that can have a significant impact on the overall system energy ... 

Keywords: Caches, energy, memory system 



10 Architectural and compiler support for effective instruction prefetching: a cooperative 
approach 

February 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 1 

Full text available* fiC] pdf(432 96 KB) Additional Information: f u ll cita tio n, ab stract, refer en ces, citings, index 
' " " " ter ms, review 

Instruction cache miss latency is becoming an increasingly important performance 
bottleneck, especially for commercial applications. Although instruction prefetching is an 
attractive technique for tolerating this latency, we find that existing prefetching schemes are 
insufficient for modern superscalar processors, since they fail to issue prefetches early 
enough (particularly for nonsequential accesses). To overcome these limitations, we propose 
a new instruction prefetching technique where ... 

Keywords: compiler optimization, instruction prefetching 
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Cache misses are currently a major factor in the cost of garbage collection, and we expect 
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them to dominate in the future. Traditional garbage collection algorithms exhibit relatively 
little temporal locality; each live object in the heap is likely to be touched exactly once 
during each garbage collection. We measure two techniques for dealing with this issue: 
prefetch-on-grey, and lazy sweeping. The first of these is new in this context. Lazy sweeping 
has been in common use for a decade. It ... 

12 Op t i m a l Cod e P lacement of E mbedded Software for Instruct i on Caches 
Hiroyuki Tomiyama, Hiroto Yasuura 

March 1996 Proceedings of the 1996 European conference on Design and Test 

Full text available: 1Sj2df(71 4.39 KB) 

Additional Information: full c it atio n, abstra ct 

ffiP* P ublish e r Si t e 

This paper presents a new code placement method for embedded software to maximize hit 
ratios of instruction caches. We formulate the code placement problem as an integer linear 
programming problem. One of the advantages of our method is that code can be moved 
beyond boundaries of functions, so that code placement is optimized globally. Experimental 
results show our method achieves 35% (max 45%) reduction of cache misses. 

Keywords: Code placement, Instruction caches, Embedded software, Integer linear 
programming problem 



13 M u ltith re ad ing I: Inst r u cti o n fetch d ef er r al u s ing s t a tic slack 
Gregory A. Muthler, David Crowe, Sanjay 3. Patel, Steven S. Lumetta 

November 2002 Proceedings of the 35th annual ACM/IEEE international symposium on 
M icroa rc h itectu re 

Full text available: = . 7yD J 

^PMlai„MpJ„^ Additional Information: full citation , abstract , references , index terms 
Publisher Site 

In this paper we present an approach to boosting performance and tolerating latency by 
deferring non-critical instructions into a deferred queue for later processing. As such, 
instruction deferral allows more critical instructions to be fetched, dispatched, and possibly 
executed, earlier. We present methods for identifying deferrable instructions using previously 
investigated notions of instruction slack. In particular we use static slack to determine if an 
instruction is deferrable. The static s ... 

14 Validated observ a ti on and re po r ting of micr o scopic perfor m ance u sing Pentium II 
coun t e r fac ilities 

Haleh Najafzadeh, Seth Chaiken 

January 2004 ACM SIGSOFT Software Engineering Notes , Proceedings of the fourth 

international workshop on Software and performance, volume 29 issue 1 
Full text available:^) pdf(599.28 KB) Additional Information: full citation , abstract , references 

Microprocessors typically have software readable counters for events such as instruction 
executions, cycles, instruction stalls, and cache misses. Besides their usefulness to report 
overall performance metrics, these counters reveal details about dynamic process behavior 
and hardware affects of compiler optimizations. Our research develops and evaluates, in 
case studies, methodologies to determine just how accurate measurements from counters 
can be. We might then compensate for, reduce and/or es ... 

15 Compiler scheduling: Compiler managed micro-cache bypassing for high performance 
EPIC processors 

Youfeng Wu, Ryan Rakvic, Li-Ling Chen, Chyi-Chang Miao, George Chrysos, Jesse Fang 
November 2002 Proceedings of the 35th annual ACM/IEEE international symposium on 
M icroa rc h itectu re 
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Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz 
boundary. For such high performance microprocessors, a small and fast data micro cache 
(ucache) is important to overall performance, and proper management of it via load 
bypassing has a significant performance impact. In this paper, we propose and evaluate a 
hardware-software collaborative technique to manage ucache bypassing for EPIC 
processors. The hardware supports the ucache bypassing with a flag in the load ... 

16 The effects of p roc e s s or ar c hit ecture o n instr uc t i o n memory traffic 
Chad L Mitchell, Michael J. Flynn 

August 1990 ACM Transactions on Computer Systems (TOCS), volume 8 issue 3 

Full text available- 1U pdf(1 48 MB) Additional Information: Mcitatton, abstract, references, cffings, index 
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The relative amount of instruction traffic for two architectures is about the same in the 
presence of a large cache as with no cache. Furthermore, the presence of an intermediate- 
sized cache probably substantially favors the denser architecture. Encoding techniques have 
a much greater impact on instruction traffic than do the differences between instruction set 
families such as stack and register set. However, register set architectures have somewhat 
lower instruction traffic than directly ... 

17 Cr itical issues regard in g HPS , a hi gh perfor ma n ce microarchitectu r e 
Y. N. Patt, S. W. Melvin, W. M. Hwu, M. C. Shebanow 

December 1985 ACM SIGMICRO Newsletter, Proceedings of the 18th annual workshop 
on Microprogramming, volume 16 issue 4 

Full text available* 111 pc ff(987 20 KB) Additional Information: full citation , abstract , references , citings, index 
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HPS is a new model for a high performance microarchitecture which is targeted for 
implementing very dissimilar ISP architectures. It derives its performance from executing 
the operations within a restricted window of a program out-of-order, asynchronously, and 
concurrently whenever possible. Before the model can be reduced to an effective working 
implementation of a particular target architecture, several issues need to be resolved. This 
paper discusses these issues, both in general and in ... 

18 Efficient ins truc ti on cache s im ul at ion and exe cut io n profiling with a threaded- c ode 
interpreter 

Peter S. Magnusson 

December 1997 Proceedings of the 29th conference on Winter simulation 

Full text available: ^| pdf(912.22 KB) Additional Information: full citation , references , index terms 



19 I m proving dire c t- ma pp ed c ach e performa n c e b y the add i tion of a small fu lly- 
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Norman P. Jouppi 

August 1998 25 years of the international symposia on Computer architecture 
(selected papers) 

Full text available: ^pdf(1.26 MB) Additional Information: full citation, refere nces , citings, index terms 



20 Compiler-driven cached code compression schemes for e mbedded I LP processors 
Sergei Y. Larin, Thomas M. Conte 



http://portal.acm.org/re^^ 11/3/04 



Results (page 1): cache probe instruction 



Page 6 of 6 



November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium on 
Microarchitecture 



During the last 15 years, embedded systems have grown in complexity and performance to 
rival desktop systems. The architectures of these systems present unique challenges to 
processor microarchitecture, including instruction encoding and instruction fetch processes. 
This paper presents new techniques for reducing embedded system code size without 
reducing functionality. This approach is to extract the pipeline decoder logic for an 
embedded VLIW processor in software at system develo ... 
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